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ABSTRACT 



Disclosed is a five-step process for producing closed cap- 
tions for a television program, subtitles for a movie or other 
uses for time-ahgned transcripts. An operator Uranscribes the 
audio track while listening to the recorded material. The 
system helps him/her to work eflBciently and produce pre- 
cisely aligned captions. Hie first step consists of identifying 
the portions of the input audio that contain spoken text. Only 
the spoken parts are further processed by the invention 
system. The other parts may be used to generate non-spoken 
captions. The second step controls the rate of speech 
depending on how fast the operator types. While the opera- 
tor types, the third module records the time the words were 
typed in. This provides a rough time alignment for the 
transcribed text. Then the foxuth module reahgns precisely 
the transcribed text on the audio track. A final module 
segments the transcribed text into captions, based on acous- 
tic clues and natural language constraints. Further, the 
speech rate-control component of the system may be used in 
other systems where transcripts are required to be generated 
from spoken audio. 

15 Claims, 6 Drawing Sheets 
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The present invention provides a semi-automatic method 
for producing closed captions or more generally time- 
aligned transcriptions firom an audio track. In a preferred 
embodiment as illustrated in FIG. 1, the invention system 
11/method is a five-step process and requires an operator to 5 
transcribe the audio being played. The system 11 helps the 
operator to work efficiently and automates some of the tasks, 
like segmentation of the captions and their alignment along 
time. A brief instruction of each step (also referred to herein 
as a software "module") as illustrated in FIG. 1 is presented lO 
next, followed by a detailed description of each in the 
preferred embodiment. It is understood that these steps/ 
modules are performed by a digital processor in a computer 
system having appropriate working memory, cache storage 
and the like as made apparent by the functional details 15 
below. 

A first module, the audio classifier 15, sorts the input 
audio 13 into different categories: spoken text, music, etc. Of 
interest are in the spoken parts of the input~audio 13 track 
because the spoken parts need to be transcribed. Possibly, a 20 
particular noise or soun d other than spoken la nguage may 
need lo be captiuned. nowever, only the spoken parts 17 as 
sorted or filtered by the audio classifier 15 are sent to" the 
next module 19. 

The next module, the speech rate -control module 19, 
controls the rate of speech depending on how fast the text is 
spoken and/or how fast the operator 5 JT>qpes; This module 
ensures that the spoken text remains understandable by 
maintaining a constant pitch. The audio produc ed 21 i s 
time -stamped since a time dependenT'tranilforcnanbn has 
been applied to the audio samples. The time stamps allow s 
the next module 23 to _i|g*^ *hp. pmper finj£^s2^lp ^rhf 
speech-rate control module 19 preferably uses speech rec- 
ognition techniques at the phoneme level. 

The third modu le, the t ime event tracker 23 re ceives the 
t ime-stamped audi o 21 and records the time the wo rd^ Wefe 
typed in by the operator 53. This provides a rough linifc 
alignment of the corresponding text 25 that will be precisely 
realigned by the next module 29. The recorded^time events 
are mapped back to the original time scale. Thus the time ^ 
event tracker 23 produces on output roughly aligned tran- 
scription text 27. 

The fourth module 29 receives the roughly aligned text 27 
and realigns precisely the text on the audio track 13 using 
speech recognition techniques at the word level. 

Realigner 29 thus outputs aligned transcribed text 31. 

Finally, the closed caption segmenter 33 breaks the 
aligned transcribed text 31 into captions, similar to a sen- 
tence for written text, based on acoustic and other clues. 50 

To that end, closed caption segmenter 33 produces the 
desired closed captions 35. 

Turning now to the particular details of each of the above 
modules as implemented in a preferred embodiment, refer- 
ence is made to FIGS. 2-4 in the discussion below. 55 
Audio Classifier Module 15 

Before playing the audio input 13 to the operator 53, the 
.audi o classifier 15 segments or otherwise sorts the audio 
•ta piIF 13 into working parts ttiat contain sp oken words. The 
audio classifier 15 also identines parts that corilglH oth er 60 
sounds of interest (like a barki ng dog, music inseilsjir a train 
pfe sing by)^ forJhfi- purposes 01 non-speecn 'cjosed caption- 
ing 71. Thiisaudio classifier 15 determines add separiltes the 
audio portions containing spoken words and the audio 
portions containing non-speech sounds needing transcrib- 65 
ing. Closed captions for the latter are produced at 71 while 
closed captions for the spoken works/speech audio are 
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produced by the rest of system 11. In summary, module 15 
enables the operator 53 to concentrate only on the spoken 
word parts that need to be transcribed. 

This approach is known in the literature as "audio clas- 
sification". Numerous techniques may be used. For instance, 
a HMM (Hidden Markov Model) or neural net system may 
be trained to recognize broad classes of audio including 
silence, music, particular sounds, and spoken words. The 
output audio speech 17 is a sequence of segments, whe re q 
c acn segment is a piece of the source audio track 13 laEclc d C 
with the class it belongs to. An example of^ a^eech \ 
segmentation system is given in "isegment ueneration and 
Clustering in the HTK Broadcast News Transcription 
System," by T. Hain et al., Proc. DARPA Broadcast News 
Transcription and Understanding Workshopj 1998. Other 
audio classifier systems are suitable. 

Note that this module 15 can eventually be integrated with 
the speech rate control module 19 since the speech rate 
control module 19 already peTforms phoneme recognition. 
In that case, additional sound or general filler models can be 
added to the phoneme models in order to'capttire non-speech 
audio. ~" 
Speech Rate Control Module 19 

ITiis module 19 controls the speech playback rate based 
on a count of speech units while playing back a recording 
(i.e., the filtered audio speech 17 output from audio classifier 
15). Speech units are typically phonemes. The speech rate 
control module 19 allows adjusting automatically the rate of 
spoken words from the audio 17 to a comfortable rate for the 
listener (transcriber operator 53). With reference to FIG. 2, 
speech recognizer 41 analyzes the audio speech 17 which is 
a recorded speech stream and produces a count of speech 
units for a given unit of time. A calculation unit of the 
recognizer 41 averages or windows this data over a larger 
unit of time to smooth the results and gives an estimate 39 
of the speech rate. A speech-playback-rate adjustment unit 
43 uses the computed speech rate estimate 39 to control the 
playback rate of subject audio speech 17. Speech-playback- 
rate adjustment unit 43 controls the playback rate to match 
a desired target rate 37 as output/determined by target rate 
calculation unit 45. The desired target speech rate 37 may be 
a predefined value or depend on an external synchronization, 
here the keyboard input (i.e., real time transcribed text) 49. 

As a result, speech rate control module 19 outputs rate- 
adjusted speech audio 47 at a desired rate for the transcriber 
operator 53 to listen to. In addition, the speech rate control 
module 19 produces a time-stamped audio 21 transformation 
of audio speech 17. 

One embodiment of the speech rate control module 19 is 
described in detail in Appendix I. 
Time Event Tracker Module 23 

This modul e 23 aulomaticallv links operator tey| 
(transcription) in put 25 with the time-stamped audio stream f 
jjj)ug)ut from^ eectfrate control Iv . i his jinking results in I 
arough alignment 27 between the transcript text and the I 
original audio 13 or video recording. • 

Preferably the module 23 tracks what the transcriber 
operator 53 has typed/input 25 and how fast the transcriber 
53 is typing. The module 23 automatically detects pre - / 
defined trigger evenrs (i.e., firsr letter after a space), time / 
sTampg t hese events and records time-stamped indices to the ' 
trigger events in a master nie m chronological o i'der. Opera- 
tor text input 25 is thus linked to the speech rate control 
module 19 time-stamped audio output stream 21 by the 
nearest-in-time trigger event recorded for the audio stream 
21 data. 

Effectively the time event tracker module 23 controls \J 
speech rate as a function of typing speed. Further details of 
one embodiment of event tracker 23 is found in Appendix II. 
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Realigner Module 29 example, a segment break is unlikely to occur after a "the" 
Referring to FIG. 3, the realigner module 29 realigns or "a" in a subset of words. This final piece of information 
words from the rough-aligned text stream 27 (output from further increases robustness of the overall system 11. 
the Time Event Tracker Module 23) in order to improve With reference to FIG. 4B, one embodiment of the 
quality of the time marks generated by the Time Event 5 segmenler 33 operates as follows. 
Tracker 23. Since captions appear on the screen as a group At a beginning step 101, segmenler 33 receives time 
of words determined by the segmenter module 33, the aligned audio 31 and original audio 13. Recall that time 
realigner module 29 is only interested in aligning precisely aligned audio 31 includes only speech to be transcribed. At 
the first and last word of each caption (group of words per step 103 segmenter 33 analyzes time aligned audio 31 and 
screen). The time indication associated with the first word in particular reads lime stamps from one word to another in 
determines when the caption should appear, and the time time ahgned audio 31. The difference between the time 
mark of the last word determines when the caption should stamp at the end of one word and the time stamp at the 
disappear (be removed) from the screen. beginning of an immediately following word is the amount 
The realigner29 uses a combination of speech recognition of time between the two words. That is, that difference 
and dynamic programming techniques and receives as input measures the length of time of the pause between the two 
the original audio track 13 and the roughly aligned text 27 words. If the pause is greater than a predefined suitable 
(from Time Event Tracker 23). The output 31 is a new threshold (e.g., one second), then segmenter 33 indicates or 
sequence of caption text with improved time alignments. otherwise records this pair of words as defining a possible 
Although time aligned, the stream of output text 31 has no break point (between the two words) for captioning pur- 
sentence formatting or punctuation (i.e., no capital first poses. 

letters of a sentence, etc.) Restated, the operator transcriber 20 From the original audio 13, segmenter 33 detects pauses 
53 may disregard punctuation and capitalization. As such the acoustically at step 105. Where low energy and sound levels 
transcribing task is made simpler and the operator can span a longer period of time than any syllable in a neigh- 
accomplish the keyboarding of text from the rate adjusted boring word, step 105 defines a pause of interest. In 
audio 47 more quickly (in a shorter amount of time). The particular, segmenter 33 defines such pauses as the end of a 
resulting output text 31 is thus a sequence of characters with 25 sentence. 

time stamps indicating time occurrence relative to the time Of the detected pauses fi-om steps 103 and 105, segmenter 

scale of the original audio 13. Additional constraints 61, like 33 may find common pauses (at the same time marks). These 

video cut time marks or additional delay may be added to have a greater possibility, than the other detected pauses, of 

improve readability of the output text 31. indicating the end of a sentence. To further verify validity of 

A realigner 29 of the preferred embodiment is described 30 this assumption (that the pause is at the end of a sentence), 

in U.S. patent application Ser. No. 09/353,729 filed Jul. 14, segmenter 33 appUes (at step 107) natural language rules to 

1999 by Assignee of the present invention, entitled "Method the words surrounding the pause. If the preceding word is an 

for Refining Time Alignments of Closed Captions" and article such as "a" or "the", then the pause is not at the end 

herein incorporated by reference. of a sentence (because English sentences do not end with an 

Closed Caption Segmenter Module 33 35 article). Other natural language rules are described in, e.g.. 

The closed caption segmenter module 33 receives as input ia Analyzing English Grammar by T.R Klammer et al., Allyn 

the stream 31 of aligned text and the original audio track 13, & Bacon, ed. 1999, herein incorporated by reference, 

and finds appropriate break points (silence, breathing, etc.) Step 107 having defined the end of sentences (from the 

to segment the text into desired closed captions. Thus the pauses of steps 103 and 105), segmenter 33 forms groups or 

segmenter 33 effectively automates the restructuring and 40 units of words between such pauses/ends of sentences, 

reformatting of the transcription text into sentences or These word groupings are effectively sentences and step 109 

phrases appropriate for captioning. The segmenter module thus provides punctuation, capitalization and other sentence 

33 preferably uses three criteria to find these break points: formatting and visual structuring. 

length of inter-word boundaries; changes in acoustic condi- The last step 111, provides the formed sentences in 

tions and natural language constraints. FIGS. 4Aand 4B are 45 segments appropriate for closed captions according to the 

illustrative as described below. time stamps of time aligned audio 31 and the corresponding 

With reference to FIG. 4A, the output 31 of the realigner lime marks of the playback of original audio 13. That is, as 
module 29 (FIG. 3) is time-stamped text. This timing the audio 13 is being played back, the first word made 
information is useful to the segmentation process since the audible in a given section is the first word of the closed 
length of pauses between words gives an indication of where 50 caption text to be displayed and the last word followed by a 
sentence breaks might be. However, the alignment process pause in the given audio section should be the last word of 
is not perfect nor are inter-word pauses necessarily consis- the closed caption text. Step 111 processes each of the 
tent between speakers. Thus the segmenter 33 additionally formed sentences in this manner to provide segments appro- 
uses acoustic and other clues. priate for closed captions. As a result, closed captions 35 are 

Some examples of segmentation schemes based solely on 55 output fi"om segmenter 33. 

acoustic information exist in the speech recognition litera- According to the foregoing discussion of the present 

lure. For example, "Automatic Segmentation, Classification invention, a semi-automatic system that assists the tran- 

and Clustering of Broadcast News Audio" by M. A. Siegler scriber in his task is provided. The invention system helps 

ct al.. Proceedings DARB\ Speech Recognition Workshop, the transcriber to work efficiently and automates some of the 

1997, describes a segmenter which uses changes in the 60 tasks, like segmentation of the captions and their precise 

probability distribution over successive windows of sound alignment along time. 

combined with energy thresholds to generate segment The invention system and method is efficient for at least 

breaks. The combination of this or a similar scheme with the the following reasons: 

inter- word pause information lends robustness to the seg- 1. The system is efficient and comfortable to use; the 
mentation process of the present invention. 65 operator doesn't have to pause and rewind the record- 
Additionally, the segmenter 33 uses natural language ing if it is playing too fast because the system self 
constraints 63 to verify possible segmentation points. For controls the rate of speech playback. 
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2. The operator can focus on the part of the video/audio it will be understood by those skilled in the art that various 
track that really has to be transcribed (spoken words). changes in form and details may be made therein without 

3. The manual alignment of closed captions along time is departing &om the scope of the invention encompassed by 
eliminated by the system (a) capturing time information Ihe appended claims. For example, captioning/subtitling 
wbile the operator types in the transcription, and (b) s systems may also be used wherever time-aligned transcripts 
realigning captions as a post-process. are required. Such transcripts are useful for multimedia 

4. Caption segmentation is performed at relevant points in indexing and retrieval similar to that discussed above. They 
(jjjjg can be used to permit the user to precisely locate t|je par ts 

In another example, a speech recognition system may be --nTtlTr virim that arf nf injfrf ^merging applicaUon 

combined with the present invention as foUows.n,e input lO f""""* custo-noed news-on-demand, distance 

audio track may be transcribed (at least in part) by a '""""S: ""d mdexmg legal depositions for assistmg case 

speech-to-lext speech recognition engine. For the portions of P«=P««"°°- Users may access such mdexed video and audio 

the audio track that the speech recognition engine does not , f ""8 s««=a™°g technologies such as 

transcribe well or cannot transcribe, the present invention is RealVideo and RealAudio &om RealNetworks 

utilized. That is, the input to the present invention system is 15 ^If transcripts may be used to mdex video material 

formed of the above-noted portionsof audio (that the speech jtoredon a digital VCR, either dehveied as closed captions 

recognition engine does not transcribe) and the operator ^tegrated in.o me signal, or dehvered separately v,a an 

would focus just on those portions. The closed captions/text Internet cannection .... 

resulting from the invention system arc then used to refine ^ \\ ' ^,^,'^^"1^ u 

the acoustic models and dictionary of the speech recognition 20 developed by the W3C called SMIL. Among other aspects, 

system. A confidence measure is utiUzed in deteraiining V^^V^f^f^'^'^^ '^''f? '° ^ „^'°?ff -^"PV""?!- 

which portions of the audio track the speech recognition ^MIL standard is supported by the RealNetworks G2 

engine does not transcribe well. format. Users can choose whether or not to view the captions 

Accordingly, the present invention creates timing infor- °' by selectmg an option on the RealPlayer So m the future, 

mation on transcription of a given audio track for use in 25 f "'S mul miedia for the World Wide Web may typically 

closed captioning, subtitling and indexing of digitally stored ">volve creating captions, as ^ currently done for TV. 

audio files. In the case of indexing, the given digital audio ^ P'°.f }^ "te*"'"™! P°rt;on of the mvenUon system 

file may be treated as a document whose content exists as a <i'>^'}^<^ h«e would be of value whenever transcripts are 

stream of audio. Each word ofthe document is time marked "^"""f - "^^^^^ °' ahgnment informaUon is 

to indicate the location of the word in the audio stream or 30 

video frame where the document is a video recording. A Appendix I 
useful index to the document (corresponding audio stream) A Speech Rate Control Method Using Speech Recognition 
cross references each word ofthe document by speaker, time The following is a method, device and system for con- 
mark and/or relative location in a language structure such as trolling the rate of speech while playing back a recording, 
sentence or music clip, etc. With such an index, a search 35 Recorded speech can be difficult to follow or to understand 
engine is able to search audio files by speaker, particular when the speech is too fast or too slow. For instance, when 
word and/or particular sentence (or other language transcribing an audio track, an operator may have to often 
structure). In a preferred embodiment, the search engine pause, rewind, or fast-forward the tape to catch all the words 
retrieves an audio clip matching the query parameters and he/she has to type in. These operations are tedious and time 
displays the results. To accomplish this, the retrieved audio 40 consuming. 

stream is downloaded and processed by a speech recognition The method described here allows automatic adjustment 

module which produces text from the input audio stream. of the rate of spoken words to a comfortable rate for the 

Where the speech recognition module employs the present listener. Used herein is a speech recognizer to parse the 

invention techniques of transcribing audio, the resulting incoming speech, and the rate of the recognized speech units 

display is formed of correctly punctuated sentences. Without 45 is used to alter the speech-playback rate. The playback rate 

the present invention transcription method and apparatus, a may be adjusted to a constant target rate or depend on an 

speech recognition module would produce text without external synchronization, 

punctuation. Abstract 

In the preferred embodiment, the end user is thus pre- The disclosed is a method to control the speech-playback 

sented with a visual display of text that corresponds to the 50 rate based on a count of speech units. A speech recognizer 

retrieved audio. Upon playback of the audio (in response to analyzes a recorded speech stream and produces a count of 

the user selecting or otherwise issuing a "play" command), speech units for a given unit of time. This data, averaged or 

the sound system of the user's computer produces the windowed over a larger unit of time to smqoth the results, 

subject audio track while the screen displays the produce d gives an estimate of the speech rate. A Speech-Playback 

transcription text in synchronization with the audio. In a 55 Rate Adjustment Unit uses the computed speech rate to 

"preferred embodiment, the audio that is downloaded afid control the speech-playback rate to match a desired rate. The 

processed by the speech recognition module is deleted after desired speech rate can be a predefined value or can be 

the transcription process. A pointer to the server where th e controlled by another measured value such as the rate 

a udia k sinrrjl i\ cmhcA^ed in the displaved resul ts.Thc transcript text is entered by a typist listening to the speech 

embedded pointer is coupled to the "play" command to 60 playback, 

effect retrieval and rendering of the audio upon user com- Puipose 

mand. The disclosed is a method, device and system to auto- 

These and other uses of the invention exist and are now matically adjust the playback rate of recorded speech to 

in the purview of one skilled in the art having this disclosure match a target rate. The target rate may either be a pre- 

before him. 65 selected value or may be computed to synchronize the 

While this invention has been particularly shown and playback rate to match an external source such as a text 

described with references to preferred embodiments thereof, transcription rate. 
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Background Matter 

This subject matter was first discussed by Assignee while 
working on multimedia indexing. One way of indexing 
multimedia documents is to use the transcription of the 
spoken content. For long documents, one would also have to 
align the text on the audio track in order to be able to go from 
the indexed text to the corresponding piece of audio. Closed 
captions, if they exist, give one that information. Recently 
Alta\^sta partnered with Virage and Compaq Cambridge 
Research Lab to build such a system. While talking with 
Alta\^sta, it was clear that one needed to find solutions to 
produce closed captions efiBciently when such captions do 
not exist. 

The first solution that came to mind would be to use 
speech recognition to generate the time aligned transcrip- 
tion. Assignee is are currently running experiments to see 
how good it would be for indexing. In many cases, one 
would still need an exact transcription and even the best 
speech recognizers are not accurate enough for that task, 
especially when the acoustic conditions vary widely. 
AltaMsta looked at having the closed captions produced by 
a third party: The cost is approximately $1,000 per hour. 
Considering that this industry is still in its infancy and the 
methods used are fairly primitive, Assignee started to look 
for semi-automatic solutions for producing closed captions 
or more generally time aligned transcriptions from an audio 
track. 

Another approach is to design a system that would help 
the transcriber in his task. The foregoing disclosure proposes 
a semi-automatic system for producing closed captions. The 
system helps the operator to work efficiently and automates 
some of the tasks, like segmentation of the captions, and 
their precise alignment along time. FIG. 1 shows the differ- 
ent modules of such a system, as described previously. 

By way of review, the first module, the audio classifier, 
sorts out the input audio into different categories: spoken 
text, music, etc. Interest here lies in the spoken parts because 
they need to be transcribed, but also possibly in particular 
noises or sounds that may need to be noted in the captions. 
Only the spoken parts are sent to the next module. 

llie next module, the speech rate-control module, controls 
the rate of speech depending on how fast the text is spoken 
and/or how fast the operator types. This module ensures that 
the spoken text remains understandable by maintaining a 
constant pitch. The audio produced is time-stamped since a 
time dependent transformation has been applied to the audio 
samples. Time stamps will allow the next module to use the 
proper time scale. The following discussion improves the 
functionality of this module. Current dictation and close 
captioning systems cannot alter the rate of speech playback. 
If the speech rate is faster than the rate the transcriber can 
type, then he will have to frequently pause the playback in 
order to catch up with the speech. If the speech rate is slower 
than the transcriber's typing rate, there is no mechanism to 
make the speech playback faster. The following discussion 
describes a method of adjusting the rate of speech playback 
automatically through the use of speech recognition tech- 
niques. 

The third module, the time event tracker, records the time 
the words were typed in by the operator. This provides a 
rough time alignment that will be precisely realigned by the 
next module. The recorded time events are mapped back to 
the original time scale. 

The fourth module re-aligns precisely the text on the 
audio track by using speech recognition techniques at the 
word level. 

Finally, the closed caption segmenter breaks the text into 
captions, similar to a sentence for written text, based on 
acoustic clues. 
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The How's 

FIG. 2 shows a block diagram of the Speech Rate Cal- 
cxilation Unit. The Speech Rate Calculation Unit analyzes 
the recorded speech data and calculates the average speech 

5 rate. This unit may operate in real time, or the averaged 
instantaneous rate values may be computed ahead of time 
during a preprocessing step. The Target Speech Rate Cal- 
culation Unit either provides a predetermined value or 
computes the rate based on a computation based on an 

10 external source, for example by the transcriber's text entry 
rate. The Speech-Playback Rate adjustment unit compares 
the speech rate of the recorded speech with the target rate. 
If the difference in rates exceeds a predefined threshold, then 
the unit speeds up or slows down the playback of the speech 

15 output to reduce the difference in rates. 
Speech Rate Calculation Unit 41 

The speech rate calculation unit 41 uses a speech recog- 
nition system to compute the rate of speech (ROS). The 
speech recognition system analyzes the incoming audio and 

20 produces the most likely sequence of linguistic units that 
matches what has been said. Linguistic units could be 
sentences, words or phonemes. Phonemes represent the 
smallest possible linguistic units. The speech recognition 
system outputs sequences of these linguistic units together 

25 with their time alignments, i.e., when they occur along the 
stream of speech. 

The recognition system may operate in one of two dif- 
ferent modes, batch or streaming. In batch mode, recognized 
linguistic units are returned once a whole chunk of audio has 

30 been processed. For instance, a batch recognition system can 
compute and renirn the best hypothesis for each linguistic 
unit once every few seconds. This approach allows running 
several passes on data and can therefore increase recognition 
accuracy. In the streaming mode, linguistic units are 

35 returned as soon as they are recognized. This approach 
allows running real-time applications like dictation. A 
method to implement a streaming version of a phonetic 
recognizer with low latency is described in U.S. application 
Ser. No. 08/977,962 filed Nov. 25, 1997 by Jean-Manuel Van 

40 Thong and Oren Glickman and entitled "Real Time Method 
for Streaming Phonemes During Speech Decoding" and 
herein incorporated by reference. A low-latency streaming 
phoneme recognizer allows the system to start playback with 
a short delay and without preprocessing. A batch system 

45 could be used, but the user would have to preprocess the 
speech data, or wait until at least a chunk of audio is 
processed before starting playback. 

The sequence of phonemes generated by the recognition 
system can be used to compute the rate of speech. The rate 

50 of speech varies globally between speakers, and locally for 
the same speaker, due to emotion, stress, emphasis, etc. "On 
the Effects of Speech Rate in Large Vocabulary Speech 
Recognition Systems" by Matthew Siegler and Richard 
Stem, in Proceedings, ICASSP, May 1995, shows that 

55 phoneme recognition can be used to give a good measure of 
rate of speech. 

In the preferred embodiment of the Speech Rate Calcu- 
lation Unit 41, phoneme recognition is used instead of word 
or sentence recognition for several reasons. First, small 

60 linguistic units give a better measure of the rate of speech 
because there is a much higher variation in the structure and 
length of words. See "Extracting Speech-Rate Values from 
a Real-Speech Database", by W. N. Campbell in 
Proceedings, ICASSP, April 1988. Also, the phonetic rate is 

65 somewhat independent of the word length, and thus the 
number of short words versus long words will not have a 
major impact on the measure of speech rate. Second, the size 
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of the input dictionary is limited to the number of phonemes, 
typically between 40 and 50 depending on the 
implementation, and therefore the system is inherently less 
complex than systems employing a larger linguistic unit. 
With phoneme recognition, a real-time, streaming system is 5 
possible, without a need for a preprocessing step. Third, 
using phonemes allows faster reactions to variations in 
speech rate. Last, using phoneme recognition makes the 
system almost language independent. Most of European 
languages use roughly the same phoneme set. lo 

The phoneme recognizer can readily detect pauses 
between words and long silences or other non-speech inter- 
ludes. The system can thus be optionally set to automatically 
skip or speed playback over long pauses or sections of 
non-speech audio such as music. To prevent the computed 15 
speech rate from dropping during periods of non-speech 
playback (because no phonemes are present), these periods 
should not be factored into the speech rate calculations. 

The measured rate of speech can be averaged in order to 
smooth the local variations of ROS. The average phonetic 20 
rate can be computed in several ways. For example, one can 
count the number of phonemes within a time window of a 
few seconds. The time v^ndow is a FIFO (First In, First Out) 
queue of phonemes and corresponds to a fixed interval of 
time. Recognized phonemes are time marked with their start 25 
and end time, usually measured in number of audio samples, 
so one knows the relative posit ion of p honemes within a 
given time w indow. The numoer or pponemes within the 
wmdow, divided by the length of the time window, gives the 
phonetic rate. 30 
Target Rate Calculation Unit 45 

The target rate can be set to a pre-defined value or may 
depend upon an external synchronization source. 

If the output speech rate is set to a constant value, the 
Speech-Playback Rate Adjustment Unit generates speech 35 
output with a phonetic rate that matches that value on 
average. 

Alternatively, the speech-playback rate may depend on an 
external synchronization source such as the text-input rate of 
an operator transcribing the recorded speech. In this case, 40 
input text is time marked and parsed into phonemes. This 
gives the typed phonetic rate. This rate is averaged like the 
spoken phonetic rate. This rate can then be used as the target 
rate input to the Speech-Playback Rate Adjustment Unit 43. 
With this approach, there should be a mechanism to flag 45 
certain typed words that do not directly correspond to 
uttered speech, for example, speaker identifications, scene 
descriptions or typographical corrections. 
Speech-Playback Rate Adjustment Unit 43 

The Speech-Playback Rate Adjustment Unit 43 plays 50 
back recorded speech at a rate based on the values provided 
by the Speech Rate Calculation Unit 41 and the Target Rate 
Calculation Unit 45. If the calculated speech rate is greater 
than the target rate, the speech-playback rate is decreased. If 
the calculated speech rate is less than the target rate, the 55 
playback rate is increased. 

There are many ways to determine the amount of change 
in the playback rate. One possible method is to increase or 
decrease the playback rate by a fixed, small increment every 
time there is a mismatch in the two rates. This method 60 
produces a slow, gradual change in the speech rate. In the 
other extreme, the rate can be changed so that it matches the 
target rate instantaneously. This will likely cause abrupt 
changes in the speech rate. The system can reduce the 
severity of the rate change by scaling the difference in rales 65 
by a constant value less than one. Yet another way is to 
change the rate based on a non-linear function of the 



difference in rates, wherein larger differences will cause a 
disproportionally large change in the rate. For example, the 
rate can be changed only if the difference exceeds a certain 
threshold. 

Once the desired change in rate is known, there are 
numerous ways to alter the playback rate of speech. The 
simplest method, direct change of the playback rate (e.g., 
playing back an analog tape at a faster or slower speed or 
playing back digitized speech at a different rate than the 
original sampling rate), will change the pitch of the voice. 
Several time-domain, interval-sampling, techniques enable a 
change in speech-playback-rate without a change in pitch 
("The intelligibility of interrupted speech" by G. A. Miller 
and J. C. R, Licklider in Journal of the Acoustic Society of 
America, 22(2): 167-173, 1950; "Note on pitch-synchronous 
processing of speech" by E. E. David and H. S. McDonald 
in Journal of the Acoustic Society of America^ 28(7): 
1261-1266, 1965; "Speech compression by computer** by S. 
U. H. Quereshi in Time-Compressed Speech, pp. 618-623, S. 
Duker, ed.. Scarecrow, 1974; "Simple pitch-dependent algo- 
rithm for high quality speech rate changing** by E. P. 
Neuberg,7o«r/wi/ of the Acoustic Society of America, 63(2): 
624-625, 1978; "High quality time-scale modification for 
speech*' by S, Roucos and A. M. Wilgus, Proceedings of the 
International Conference on Acoustics, Speech, and Signal 
Processing, pp. 493-496, IEEE, 1985; and "Real-time time- 
scale modification of speech via the synchronized overlap- 
add algorithm'* by D. J. Hejna, Jr., Master's thesis, Massa- 
chusetts Institute of Technology, February 1990, Department 
of Electrical Engineering and Computer Science.) These 
techniques are generally low in complexity (real-time soft- 
ware implementations are possible), but some of the simpler 
methods will result in audible artifacts which may hurt 
intelligibility. Several frequency-domain methods also 
enable a change in playback rate without a change in pitch 
("Time-domain algorithms for harmonic bandwidth reduc- 
tion and time scaling of speech signals*' by D. Malah,/£EE 
Transactions on Acoustics, Speech, and Signal Processing, 
i451SP-27(2):121-133, April 1979; "Time-scale modification 
of speech based on short-time Fourier analysis" by M. R, 
Portnoff, IEEE Transactions on Acoustics, Speech, and 
Signal Processing, A53'P-29(3):374-390, June 1981; and 
"The phase vocoder: A tutorial** by M. Dolson, Computer 
Music Journal, 10(4): 14-27, 1986). These methods are 
generally more complex than the time-domain techniques, 
but result in higher quality speech. Another indirect way of 
altering the overall playback rate is to abbreviate or extend 
the length of pauses in the speech based on silence detection 
("TASI quality — Effect of Speech Detectors and Interpola- 
tors*' by H. Miedema and M. G. Schachtman, in The Bell 
System Technical Journal, pp. 1455-1473, 1962). This 
approach has the advantages of low complexity and possibly 
more natural sounding rate-adjusted speech by keeping the 
uttered speech rate consistent with the original recording. 

The method described here is not limited to transcriptions 
of recorded speech. Since the phoneme recognition can be 
implemented in a streaming manner, transcription can be 
done on-line with only a slight delay from *live'. The system 
could keep up with rate of slowed-down speech by short- 
ening non-speech segments, and buffering data up-front. 
Today, on-line transcribers start to transcribe broadcasts in 
advance by looking ahead during commercials. 

This method is useful and efficient for the following 
reasons: 

1. Once embedded in a transcription system, the method 
relieves the user of having to manually control the 
playback rate whenever he encounters a change in the 
transcription load. 



08/03/2004, EAST version: 1.4.1 



us 6,51 

13 

2. The system adjusts the speech rate based on acoustic 
information, and is independent of who is speaking, the 
rate at which he speaks, and ultimately the spoken 
language. The system makes the playback of fast and 
slow talkers more uniform. 

3. When the target rate is calculated based on the typing 
rate, the system automatically adjusts the speech rate to 
match the typing rate of the transcriber, without requir- 
ing bis intervention. 

4. The target rate can be selected based on a more intuitive 
value, such as words per minute. 

5. The speech rate adjusted to the calculated or pre- 
defined rate is intelligible. 

Appendix II 
Time-Event Tracker Method and System 

The disclosed is a method and system to automatically 
link user input with a playback or recording of strea ming 
data througti time-corrciated, event-generated data pomter s. 
*T^e disclosed can be used to Imk meeting minutes with an 
audio recording of the meeting. A closed captioning system 
could also use this method to get a rough alignment between 
the transcript and the audio or video recording. 
Abstract 

The disclosed is a method and system to automatically 

(link user input with a pre-recorded playback or live record- 
ingof streaming dat a via time-stamped event point ers. The 
sy ste^HUlomatlually detects pre-detineo trigger events , 
pafrs an event label with a pomtcr to the data that caused th e 
event , th en time stamps and records the data in a master fil e. 
Osjr input is thu s linked to tbe streamiag data by "t he 
ne arcst-m-time trigger event rec orded for the stream ing 

Purpose 

The disclosed is a method and system to automatically 
link user input with a playback or recording of streaming 
data through time-correlated, event-generated data pointers. 
There arc many practical applications of such a system: 

The disclosed links meeting minutes with an audio 
recording of the meeting. 

The closed captioning system discussed above uses this 
method to get a rough alignment between the transcript 
and the audio or video recording. 

The disclosed could be incorporated in a video annotation 
system. 
Background Matter 

Finding a desired section within a recording of streaming 
data (e.g., an audio or video tape recording) is typically 
difficult even for the person who made the recording. 
Without some additional information, it is impossible for 
someone who has never witnessed the recorded data to find 
a desired section short of playing back the data from the 
start. Often it is desirable to correlate thp contents of an 
audio or video recording with additional user data^ such as 
notes or annotations . For example, an audio or video record- 
ing of a presentation could be correlated with a set of notes 
taken during the course of the talk. 

Systems for event logging provide a basic correlation 
capability. The UNIX utility syslogd provides a facility for 
time-stamped logging of user-defined events. Events gener- 
ated by muhiple sources are logged into a central file along 
with their time-stamps. Events can be correlated by com- 
paring time-stamp values. Similarly, in an X windows 
system, user events can be logged along with their time- 
stamps as a record of the user's activities. 

There is a variety of prior art describing event logging 
systems and their applications. TTie patent "Arrangement for 
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Recording and Indexing Information" (U.S. Pat. No. 4,841, 
387 to Rindfuss) addresses the specific problem of correlat- 
ing hand-written notes to a reel-type audio recording. The 
invention is a device that automatically correlates the po si- 

5 Jion ot tbe hand-wntten notes on a page with tne position o f > 
"Ihe audio tape on the reel. Th e patent "Interactive System for 
nproducing. Storing and Retrieving Information Correlated 
with a Recording of an Event" (U.S. Pat. No. 5,564,005 to 
Weber et al.) is a more general approach to linking infor- 

10 mation with a recording. However, that system forces the 
user to explicitly reques t a "time zone" event in order t o 
correlate subse q uent use r input to a given section of 
recorded data. I'he patent "Method and Apparatus for 
Recording and Analyzing an Interaction Lxig" (U.S. Pat. 

15 5,793,948 to Asahi et al.) describes a system with an event 
detecting means that triggers a state detecting means 
wherein the state detecting means detects at least two 
parameters, position and time, of objects within a display 
window. These parameters are recorded and linked with the 

20 triggering event, but not with another synchronous data 
stream. The patent "Computer Graphics Data Recording and 
Playback System with a VCR-based Graphic User Interface 
(U.S. Pat. No. 5,748,499 to Tmeblood) is limited to record- 
ing of an X-windows session wherein X-windows 

25 commands, states, and events are linked via time-stamps. 
Similarly, the patent "Computerized Court Reporting Sys- 
tem" (U.S. Pat. No. 4,924387 to Jeppesen) is limited to 
synchronizing stenographic annotations with a video record- 
ing of a trial. 

30 Event logging methods assume that a data-pair consisting 
of an event label and a time-stamp is sufficient to determine 
the relationship between time sequences of actions. This is 
often the case for computer system events or X window 
events. 

35 However, this approach is problematic for streaming 
media data. An event logging approach is possible if one 
assumes that streaming data is produced or recorded at a 
known rate. Given the data rate and a particular time stamp 
it is possible to determine a position within the data stream. 

40 However, in many important applications the streaming rate 
can.yar y asynchronous lv. as when a media source such as a 
tape player is paused, reversed, or advanced frame-by-frame [ 
using a thumbwheel. ^ 
An alternative to correlating time-stamped events with a 

45 COnj inuous data stream {q hr^aV a fjat^ stream up intn / 

chunks^ ea ch associated with a particular even t. An example/ 
of this approach is the "record narration" feature ofl^ 
Microsoft Powerpoint97. Each slide in a presentation can ba 
linked to a separate audio file which will be played wheneveil 

50 that slide is rendered. 

In cases where a multimedia data stream contains an 
audio channel, speech t echnology can be used to correlate C\ 
Jhe data stream with a texrtranscnpt . by performing speech ^ 
recognition on the audio channel it may be possible to 

55 correlate words in a transcript with positions in the data 
stream. However, this approach is limited to data streams 
with speech audio on which speech analysis is successful. 
The How's 

The disclosed is a time event tracker system 23 for 
60 automatically correlating one or more unsynchronized 
streams of data by means of pred efined, tim e-stampe3~ 
^ events wjgi associated data pomters. Fl(j. a snows a block 
diagram^f the system 23. The system^23 consists of a^ 
plurality of Event-Triggered Data Pointer Generators 75a 
65 and 15b and an Event Pointer Recording Unit 81. Each 
generator unit 75 is designed to detect a set of user-defined 
events and communicate the occurrence of each event along 
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with an associated data pointer. Each generator 75 produces ]| . . An important point is that in many applications such as 

a sequence of pairs of the form (E,., where E,. is an event /'// transcription, preprocessing of the index is necessary before 
label, ?i is a data pointer and i denotes the position in thel^ the time-stamp matching process can be performed. There 

sequence. t^o categories of preprocessing: event filtering and data 

The dotted lines in FIG. 5 illustrate the use of the system 5 pointer interpolation. Event filtering is necessary whenever 

23 for on-line audio transcription. In this application, a user insertions or deletions are being made into a data stream, 

listens to an audio signal (from 83) and transcribes it into occurs in the transcription module side 91 of FIG. 5, 

text 79 in real time. The user has a means to slow down or ^"^f .^^^^ "^^^^^ ^^1^^^^, transcript 79 in 

speed up the rate of audio playback as needed to aid in ^5^1-^"^^ ^ the user corrects mistakes or omissions. Dele- 

transcription. Tliis is described in Appendix I. Note that the lo f ° ^^^^^^^ ^^'^f data pomter to the 

1 u I J ■ 1 J u f 1 14 J 1 data pomter of a previous append event, and then removing 

playback device 83 could be an external analog tape deck ^oth events from the index 87. Likewise, the data pointer! 

which IS physically distinct from the computer system on .^^ciated with insertion events are assigned a tim^stamp 

which the transcnption ^ being entered. There are two ^^ich is consistent with their position in the stream. Tliis is 

generator units ISa and 75b. They take their inputs from the ^^ne by changing the time-stamp for insertion events to the 

word processing software 85 on which the transcription is 15 average of the time-stamp values for the data pointers that 

entered and from the external tape deck 83 (through an border the insertion. As an example consider the following 

appropriate hardware interface). sequences of triples produced by the Event Pointer Record- 

Thc trigger events for transcription 91, labeled e,- in FIG. ing Unit 81 from the output of the transcript-side data 

5, are generated whenever the user types a new word in the pointer generator 75fl and stored in the index 87: (append, 

word processing application 85. There are three types of 20 500, 20), (append, 600, 25), (append, 70035), (delete, 500, 

transcription trigger events, called "append", "insert" and 40), (insert, 650, 50). After filtering, this sequence becomes: 

"delete". "Append" events are generated whenever a word is (append, 600, 25), (insert, 650, 30), (append, 700, 35). 

added to the end of the current transcript. "Insert" and The second preprocessing operation is data pointer inter- 

"delete" events are generated whenever the user backs up polation. TT^is is per fnrm ^fi in nrHpr tp generate a regular 

inside previously typed text and inserts or deletes a word. 25 ^tream ofdata p oint ers from a sparse set of trigger events . 

ITie associated data pointers, labeled p^ in FIG. 5, give the This is necessary in the playback module/side 77 of the 

absolute position of each word within the document 79. In transcription system, since the goal is to correlate the 

the preferred embodiment, this is a word identifier which is transcript 79 with the audio stream 89 at a much finer 

incremented in units of 100 to accommodate insertions and temporal scale than that of the tape deck trigger events. Each 

deletions. A typical sequence of event-pointer pairs on the 30 event such as play or fast-forward changes the playback 

transcription side 91 might be: (append, 300), (append, 400), speed of the tape. Thus this speed is known between events, 

(insert, 350), (append, 500), etc. The transcript-side data Given the time-stamps for two events, the elapsed time that 

X pointer generator 75 a unit can easily produce word identi- separ ates them can be divided into reffli lar /ntey y^k an d 

~7 fiers for append events byTncrementmg the most recen t assigned additi onal lime-siamp s. uiven the data pnintei?at 

<L identifier by 100 . I dentifiers tor insert events can be pro- 35 the event occuirence and theplayback speed, it is simple to 

f duced by cot&puting the average of the identifiers for the tw o interpolate the pointer values at the boundary across the time 

words which border the insertio n! " intervals. The result is a dense set of data pointers at regular 

The playbacJc-side pointer generator 7Sb takes its input intervals. The size of the sampling intervals are chosen / 

from the analog tape deck 83. Playback trigger events, depending on the desired resolution in correlating the text 79/t(AJL 

labeled f^, in FIG. 5. are produced whenever there is a 40 to the audio 89. ^ 

^ange in the playback rate of the audio data. In a simple Data pointers are necessary in dealing with a data stream, 

^taetrffTion, there wouia oe nve trigger events fh corre- such as words in a transcript, that is not contiguous with 

sponding to the functions play, stop, fast-forward, rewind real-time. The presence of insertions and deletions removes 
and pause. The data pointers are labeled q^, in FIG. 5. IJey^ any simple coaelation between stream position and real- 

. identify the abs^hit e position in the audio source, calledlb e 45 time. Correlation can be restored through post-hoc analysis 

au dio positio n, at whi ch a trigger event occurred^ In a sunple using the stored data pointers. 

instantiati on, a pointe r could be a burned- in time-code on t he Data pointers are also useful in providing robustness to 

Tape whicfis read by a nigh-end deck such as a Betacam. A system latency. This can be illustrated using the example of 

typical sequence of event-pointer pairs on the playback-side audio transcription. Suppose the task is to correlate user 

77 might be: (play, Tq), (rewind, Tj), (play, Tj), (fast- 50 input with an audio stream. The Event-Triggered Data 

forward, T3), etc. where the T/s are time-codes. Pointer Generator 7Sb for the audio playback device 83 

ITie Event Pointer Recording Unit 8 1 as signsa universa l produces a sequence of pairs consisting of a playback event 

time-stamp to each event-pointer pairrfEsntTrngTiTa tripled) f label (play, fast forward, rewind, stop, pause) and its asso- 

'ftie form (E;,P,-,t^ where t,. denotes the time-stamp forlhe ith ciated data pointer (e.g. the audio position). These data 

triple. The triples are recorded in the Time-stamped Event 55 provide a complete record to the point in the audio signal 

Pointer I ndex 87, wher^ they ran be nrH ered on the bas is of where the event occurred. Latency at the Event Pointer 

4b eir time-stamp value. In the example shown in hlG. 5. two Recording Unit 81 will only affect the accuracy of the 

sequences of triples of the form (co,po,to), (ei,pi,ti), . . . and temporal correlation between the user input events and the 

(^,qo»t'o)» • - • are logged in the index. In this audio playback events. It will not affect the positional 

example, it is assumed that each time-stamp t'i is matched to 60 accuracy of the data pointer. In contrast, an event logging 

time-stamp t^. If each stream of event pairs is recorded system would simply time-stamp the event labels. In these 

separately, the matching process may be accomplished with systems, audio positions must be deduced from the time- , 

a simple merge sort. The time-stamp matching process stamp an J the audio playback"7ate associated^w ith each 

serves to correlate the unsynchronized data streams repre- event. Thus any difference between the recorded time-stamp 

sented by the sequences of pointers p.- and In the 65 and the actual time at which an event occurred will produce 

transcription application, words in the transcript 79 are errors in the audio position. These errors will be further 

matched to positions on the audio tape 89. compounded by latencies in logging user events. 
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The ordered index 87 that results from time-stamp matchy pointers are the slide number or an identifier for the within- 

ing serves to cross-link multiple data streams, permitting shde animation. The presentation system 95 makes it pos- 

several different kinds of access. For example, given & sible to correlate notes against the particular slide or ani- 

desircd time, the closest time-stamp for each data stream can mation displayed by the presenter. Note that out-of-order 

be identified. Using the associated data pointers, the corre- 5 access to slides is also present in this application, analogous 

spending data stream positions associated with that time can to out-of-order generation of text entries in the note-taking 

be identified. In the above example, suppose that tj and t'^ module. 

were the closest time stamp matches to a given query time \ Events for the audio-video capture device 57 could be 

t. Then the pointers p,- and would give the data stream \ very simple, possibly just indicating the beginning and end 

positions associated with time I. 'lo of the presentation. Implicitly generated events and data 

Alternatively, given a position in a particular data stream, pointers would provide pointers into the recorded stream at 

the closest data pointer stored in the index can be identified. regular, pre-defined intervals. The recording unit 81 for the 

Theji me-stamp for that entry can then be compared to the meeting server 59 is synchronized with the individual 

tirnej unTiTs tor the otner aa ia streams, rcsuuing in a set of recording units 81 on each user's portable computer 97, to 

corresponding entnes in tne index, rhe data pointers for is ensure consistent time-stamps. 

these entries can be followed, given the corresponding After the talk is over, each user would have the option of 
positions in the other data streams. For example, suppose correlating their notes against both the audio-visual record- 
that po was the closest data pointer to a given query position, ing and the time-stamped pointers into the presentation 
Then since to is matched to t'o (after appropriate event shdes. This could be done via a simple web interface. It 
filtering and data pointer interpolation), it follows that q^ is 20 could be performed by doing a merge sort between the user's 
the position in the second data stream that corresponds to the event-pointer index 87 and the one produced on the meeting 
query position. This could be used in the transcription server 59. The same web interface could then allow each 
example to synchronize the displav of the transcript 79 witb /L user to retrieve a presentation slide and video clip associated 
Jhc TffavEacK oi mc^auHio 89 . It could also be used forS with each sentence or paragraph in their notes. In addition, 
multimedia indexing. In that case, text queries would pro- 25 the user could replay the presentation recording, with both 
duce matches in the transcript text 79 which could then be the PowerPoint slides 55 and their recorded notes scrolling 
linked directly to segments of audio 89 content. in synchrony with the audio-visual 57 playback. 

Furthermore the recorded content can be modified in a In a further embodiment, the notes from all of the users in 

post-processing phase to incorporate the cross-links pro- attendance could be pooled into one larger time-matched 

duced by time -stamp matching. In the transcription system, 30 index, by simply merge sorting all of the separate index files 

for example, words in the transcript 79 could be automati- 87. The result could be easily cross-linked against the 

cally hyperlinked to spoken words in the recorded data S^, presentation slides 55 and audio-video recording 57 as 

using the capability for indexing into a data stream provided described above. A searchable text index of the pooled notes 

by formats like RealAudio. Or frames in a video sequence could then be constructed, using AltaVista search 

could be automatically labeled with text from a correlated 35 technology, for example. This would allow people who had 

transcript. not attended the talk to search for keywords in the pooled 

It should be clear that the time-event tracking system 23 notes. Each bit would have a set of cross-linked data pointers 

could be adapted to many different scenarios. For example, that would make it possible to retrieve the slides and video 

by changing the time interval used in pointer interpolation clips of the presentation associated with the keywords. The 

one can address both closed captioning applications where 40 retrieval quality of this index could be further improved by 

many words map to a single video frame, and transcription augmenting be index with text from the slides 55 

applications where words map to audio samples at a much themselves, as well as transcripts of the resentation obtained 

finer level of granularity. FIG. 6 shows the architecture for through speech recognition. 

a presentation-recording system 95, which is a second Because the system 95 explicitly stores data pointers, one 

application that is enabled by the disclosed time -event 45 can cross-link data streams w Mch do not sh are a common 

tracker module of FIG. 5. In the presentation recording clock, or even have a consistent real-Ume clock . I^or 

system 95, a multiplicity of users watching a presentation example, if one stream consists of video or audio data it is 

take notes using an applic ation such as a word p rocessor. In not necessary that the other streams be expressible in terms 

a preferred embodiment, each note-taker is using a portable of video frame rate or audio sample rate. Furthermore, the 

computer 97a, b, such as a laptop, with a wireless network 50 system 95 supports data streams in which the data rate can 

connection 99. Each portable 97 is running a subset of the change dramatically in an asynchronous or unpredictable 

architecture in FIG. 5, consisting of a data pointer generator fashion. ' • 

75 for the word processor and a recording unit 81. In this ^"fiCeifrTriggered Data Pointer Generators 75 

case, each recording unit is only logging one stream of The Event-Triggered Data Point Generators 75 detect 

event-pointers. All ofjhe recording units 81 are synchro - 55 trigger events from, for example, a user input application 85 

nized via the wireless network 9y so tneir time -stamps wi ll or a streaming data recording or playback device 83. The 

be consistent. Here text events migtt be generated at the trigger events are prespecified and can be explicit or 

icvCToTparagraphs or sentences rather than words as in the implicit. Explicit trigger events include, for example, an 

transcription example. Note that the issue of handling inser- auto-save event in a word processing appfication 85, or a 

tions and deletions remains. 60 *stop', 'play*, * record', or 'pause' command on a streaming 

The presentation system 95 includes a second instantia- data recording/playback device 83. 

tion of the architecture from FIG. 5, which runs on a meeting User input data is often created using a PC appfication 

server 59. It has two generator modules 75c, d, which such as a word processor 85. To get more resolution for the 

service a slide presentation appUcation 55, such as MS user input data, the Event-Triggered Data Pointer Generator 

PowerPoint, and an audio-visual capture device 57. The 65 ISa (FIG. 5) may incorporate a gateway application to 

event triggers for the PowerPoint appUcation 55 would be intercept keyboard and mouse inputs before they are passed 

either a slide change or a within-slide animation. The data to the user input application 85. Most operating systems 
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enable a keyboard intercept functionality. For example, the 
Windows 95 operating system has the VXD_Filter_ 
Keyboard_unput keyboard input service to allow a program 
to access all keyboard input before it is processed. This 
approach would allow trigger events to resolve down to the 5 
individual keystrokes. 

Because of the nature of the streaming data recording/ 
playback devices 83, only the *play', 'record*, *fast 
forward*, 'rewind*, 'pause*, and 'stop* events need be 
explicitly communicated to the Event-Triggered Data 10 
Pointer Generators 7Sb , Given a fixed playback or record 
data rate, one embodiment of the Event-Triggered Data 
Pointer Generator ISb could automatically generate data 
trigger events based on elapsed time, thus, implicit trigger 
events occur for every predetermined, fixed, interval of time. 15 
In this embodiment, the Event-Triggered Data Pointer Gen- 
erators 7Sb also must implicitly generate the data pointers, 
resulting in a possible reduction of accuracy. To improve the 
resolution of links to the streaming data events, the Event- 
Triggered Pointer Generators 15b for such data typically will 20 
trigger at a much higher rate than the Event-Triggered Data 
Pointer Generators 15a monitoring the user input applica- 
tion. 

Event Pointer Recording Unit 81 

UTie Event Pointer Recording Unit 81 gets data pairs from 25 
the Event-Triggered Data Pointer Generators 75, time- 
stamps each pair, filters the events and interpolates the data 
pointers as needed, then sorts the events according to 
chronological order, and records the events that are relevant 
to the linking between data streams. Events that can be 30 
deduced from other recorded events do not have to be 
recorded. Linking between user input data events and the 
streaming data events is implicit and based on proximity in 
time. This linking has a fairly coarse resolution, within a few 
seconds, depending on the reaction speed of the user and/or 35 
the user input application 85. For the case of speech record- 
ing or playback, the time alignment can be further refined 
through the use of speech recognition as discussed previ- 
ously. Another way to refine the speech alignment is to back 
up the user input event index to the beginning of a sentence 40 
by detecting pauses in the recorded audio. 
Implementation 

One embodiment of the time-event tracking system 23 
(the architecture) of FIG. 5, is as a software library with 
associated API that would allow any PC or handheld com- 45 
puter to function as an annotation device. Using the API, the 
user interfaces to any standard application such as Microsoft 
Word as weU as to video capture cards. The API also defines 
an interface to search technology such as AltaVista Personal 
Edition that would make it easy to search a set of recorded 50 
annotations. This configuration may be viewed as extending 
the utility of standard retrieval technology by providing a 
means for a user to link searchable attributes to a media 
stream that would otherwise remain inaccessible to indexing 
techniques. Using this configuration of time-event tracker 55 
system 23, many specialized systems for tasks such as court 
recording could be replaced by PC systems running a 
general purpose software library with an off-the-shelf cap- 
ture card. 

For example, one data stream might consist of a series of 60 
words typed into a document editor such as Microsoft 
Word97. These words would comprise a set of notes for a 
presentation, captured in real-time during the talk. A second 
data stream might be an audio and video recording of the 
presentation. Using the system 23 of FIG. 5, it is possible to 65 
easily establish the correlation between words in the notes 
and audio/video samples within the data multimedia stream. 
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For another example, this system 23 may be used to 
automatically link user-entered meeting notes with the rel- 
evant sections in an audio recording of that meeting. The 
system 23 may also be used to gel a rough alignment 
between the transcript and the audio or video recording 
(described above). In education and surveillance 
applications, a user may attach their own searchable anno- 
tations to course materials and captured footage. Many of 
these applications can gain considerable leverage from the 
ability to rapidly search the user input which has been 
associated with a media stream using standard indexing and 
retrieval technology. 
ThQ Why*s 

The disclosed time-event tracking system 23 links time- 
correlated data pointers generated by trigger events, instead 
of merely a description or identification of the trigger events. 
This approach accommodates the asynchronous nature of 
some media sources. It allows an arbitrary playback speed 
with a possibly non-streaming, noncontinuous presentation 
clock. For example, in the closed captioning system 11 (FIG. 
1) disclosed above, the audio playback rate could be sped up 
or slowed down to match the transcriber's typing rate. The 
audio playback device can also be paused, rewound or fast 
forwarded during the transcription process. Because the 
Event-Triggered Data pointer Generator 756 provides a 
direct pointer (e.g., the tape counter or a recorded time stamp 
as opposed to an event time stamp) to the data being played 
during a trigger event, the transcription text 79 is linked to 
the appropriate audio 89. Another part of the system 11 then 
fine times the audio link to match the typed words with the 
spoken words as discussed previously. Because data linking 
is based on a recorded clock instead of a presentation clock, 
the audio clock does not necessarily have to be the master 
clock of the system 11, unlike most other systems. 

This approach also enables the synchronization of user 
input to media playback/recording devices that are not 
tightly integrated into a unified system. It is well suited to 
systems that have little or no control of an external, asyn- 
chronous media stream. The Event Pointer Recording Unit 
81 only needs trigger event notifications from an external 
source in a timely fashion. The notifications must include a 
pointer to the appropriate point in the data when the trigger 
event occurred. For example, an external videotape player 
would have to provide its tape counter whenever a trigger 
event (any VCR command) occurred. 

This approach makes order of creation of data streams 
unimportant. For example, when used in a text transcription 
system, the system 25 allows a user to create a transcript 
before, during or after an audio recording. If a transcript is 
created before an audio recording, there must be a mecha- 
nism (e.g., user input) to generate trigger events to the text 
during the subsequent recording. 

Once the trigger events are defined, they are automatically 
detected and linked without user intervention. 

The disclosed system 23 is useful and efficient for the 
following reasons: 

1. The system 23 is mostly independent of the user input 
application. No modification to the application is 
required to provide for the generation of observable 
trigger events. 

2. The trigger-event detection scheme is very general and 
adaptable. For example, if a word processing program 
is used for user input, the auto-save or auto-recovery 
mode could be set for the most frequent updates. The 
trigger event can then be the auto-save action. Higher 
resolution is possible with the use of a keyboard/mouse 
gateway application. 
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3. The linking operation is automatic. Once the trigger 
events are defined, operation of the time-event tracking 
system 23 is transparent to the user. 

What is claimed is: 

1. A method for producing time-aligned transcripts of an 5 
audio track, comprising the steps of: 

in response to an input audio stream, determining spoken 

parts of the audio stream; 
transcribing the determined spoken parts of the audio 

stream by using an audio rate control routine, said 

transcribing producing transcription text; 
adding time marks to the transcription text by detecting 

trigger events based on time of event keystrokes by an 

operator performing the transcribing; and 
re-aligning precisely the transcription text on the input 

audio stream. 

2. A method as claimed in claim 1 further comprising the 
step of segmenting the realigned transcription text into 
closed captions. 20 

3. A method as claimed in claim 2 further comprising the 
step of generating non-speech captions from parts of the 
input audio stream which were determined to be other than 
the spoken parts. 

4. A method as claimed in claim 2 wherein the segmenting 25 
includes: 

detecting pauses acoustically; 

determining from the detected pauses potential end of 

sentences; and 
accounting for natural language constraints in the poten- 

tial end of sentences to determine language legitimate 

end of sentences, said segmenting being based on the 

determined legitimate end of sentences. 

5. A method as claimed in claim 1 wherein the audio rate 
control routine includes: 

counting speech units in the spoken parts and producing 

a count of speech units for a given unit of time; 
estimating a speech rate from the count of speech units; 



40 



using the estimated speech rate, controlling playback of 
the spoken parts of the audio stream to match a target 
rate. 

6. A method as claimed in claim 5 wherein the target rate 

is about equal to rate of operator operating a keyboard to 45 
effect the transcribing. 

7. A method as claimed in claim 1 wherein the steps of 
determining spoken parts, adding time marks and realigning 
are performed in a digital processor; and 

the audio rate control routine is performed by a digital 50 
processor. 

8. A method as claimed in claim 1 wherein the realigning 
produces time marks correlating character strings of the 
transcription text to corresponding parts of the input audio 
stream; and 55 

further comprising the step of using the produced time 
marks for indexing respective position in time of the 
audio stream of various character strings in the tran- 
scription text, the indexing enabling a search on a 
desired character sU-ing to produce location in the audio 
stream where the corresponding audio part for the 
desired character string exists. 

9. Apparatus for producing time aligned transcripts of an 
audio track comprising: 



an audio classifier, in response to an input audio su:eam, 
the audio classifier determining spoken parts of the 
audio stream; 

audio rate controller coupled to receive the determined 
spoken parts of the audio stream from the audio 
classifier, the audio rate controller controlling rate of 
playback of the determined spoken parts of the audio 
stream to a transcriber transcribing the determined 
spoken parts and producing transcription text; 

a time event tracker for adding time marks to the tran- 
scription text by detecting trigger events based on time 
of event keystrokes by the transcriber performing the 
transcribing; 

a realigner responsive to output by the time event tracker, 
for precisely realigning the transcription text on the 
input audio stream; and 

a segmenter coupled to receive from the realigner the 
realigned transcription text, the segmenter segmenting 
the realigned transcription text to form closed captions. 

10. Apparatus as claimed in claim 9 wherein the audio 
classifier further generates non-speech captions from parts 
of the input audio stream determined to be other than the 
spoken parts. 

11. Apparatus as claimed in claim 9 wherein the audio rate 
controller: 

counts speech units in the spoken parts and produces a 

count of speech units for a given unit of time; 
estimates a speech rate from the counted speech unit; and 
using the estimated speech rate controls playback of the 
spoken parts of the audio stream to match a target rate. 

12. Apparatus as claimed in claim 11 wherein the target 
rate is about equal to rate of transcriber operating a keyboard 
to effect the transcribing. 

13. Apparatus as claimed in claim 9 wherein the audio 
classifier, audio rate controller, time event tracker, realigner 
and segmenter are executed in a digital processor. 

14. Apparatus as claimed in claim 9 wherein the realigner 
produces time marks correlating character strings of the 
transcription text to corresponding parts of the input audio 
stream; and 

the apparatus further comprises an indexer, said indexer 
using the produced time marks to index respective 
position in time of the audio stream of various character 
strings in the transcription text, such that in response to 
a search on a desired character string, the indexer 
produces location in the audio stream where the cor- 
responding audio part for the desired character string 
exists. 

15. Apparatus as claimed in claim 9 wherein the seg- 
menter further: 

detects pauses acoustically; 

determines from the detected pauses potential ends of 
sentences; and 

accounts for natural language constraints in the potential 
ends of sentences to determine legitimate end of 
sentences, said segmenter segmenting the realigned 
transcription text according to the determined legiti- 
mate end of sentences, to form closed captions. 
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